GATK4 Germline Variant Caller¶
GATK4_GermlineVariantCaller
· 2 contributors · 2 versions
This is a VariantCaller based on the GATK Best Practice pipelines. It uses the GATK4 toolkit, specifically 4.1.3.
It has the following steps:
- Split Bam based on intervals (bed)
- HaplotypeCaller
- SplitMultiAllele
Quickstart¶
from janis_bioinformatics.tools.variantcallers.gatk.gatkgermline_variants_4_1_3 import GatkGermlineVariantCaller_4_1_3 wf = WorkflowBuilder("myworkflow") wf.step( "gatk4_germlinevariantcaller_step", GatkGermlineVariantCaller_4_1_3( bam=None, reference=None, snps_dbsnp=None, ) ) wf.output("variants", source=gatk4_germlinevariantcaller_step.variants) wf.output("out_bam", source=gatk4_germlinevariantcaller_step.out_bam) wf.output("out", source=gatk4_germlinevariantcaller_step.out)
OR
- Install Janis
- Ensure Janis is configured to work with Docker or Singularity.
- Ensure all reference files are available:
Note
More information about these inputs are available below.
- Generate user input files for GATK4_GermlineVariantCaller:
# user inputs
janis inputs GATK4_GermlineVariantCaller > inputs.yaml
inputs.yaml
bam: bam.bam
reference: reference.fasta
snps_dbsnp: snps_dbsnp.vcf.gz
- Run GATK4_GermlineVariantCaller with:
janis run [...run options] \
--inputs inputs.yaml \
GATK4_GermlineVariantCaller
Information¶
URL: No URL to the documentation was provided
ID: | GATK4_GermlineVariantCaller |
---|---|
URL: | No URL to the documentation was provided |
Versions: | 4.0.12.0, 4.1.3.0 |
Authors: | Michael Franklin, Jiaan |
Citations: | |
Created: | 2019-09-01 |
Updated: | 2019-09-13 |
Outputs¶
name | type | documentation |
---|---|---|
variants | Gzipped<VCF> | |
out_bam | IndexedBam | |
out | VCF |
Workflow¶
Embedded Tools¶
GATK4: SplitReads | Gatk4SplitReads/4.1.3.0 |
GATK4: Haplotype Caller | Gatk4HaplotypeCaller/4.1.3.0 |
UncompressArchive | UncompressArchive/v1.0.0 |
Split Multiple Alleles | SplitMultiAllele/v0.5772 |
Additional configuration (inputs)¶
name | type | documentation |
---|---|---|
bam | IndexedBam | |
reference | FastaWithIndexes | |
snps_dbsnp | Gzipped<VCF> | |
intervals | Optional<bed> | This optional interval supports processing by regions. If this input resolves to null, then GATK will process the whole genome per each tool’s spec |
haplotype_caller_pairHmmImplementation | Optional<String> | The PairHMM implementation to use for genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime. The –pair-hmm-implementation argument is an enumerated type (Implementation), which can have one of the following values: EXACT;ORIGINAL;LOGLESS_CACHING;AVX_LOGLESS_CACHING;AVX_LOGLESS_CACHING_OMP;EXPERIMENTAL_FPGA_LOGLESS_CACHING;FASTEST_AVAILABLE. Implementation: FASTEST_AVAILABLE |
Workflow Description Language¶
version development
import "tools/Gatk4SplitReads_4_1_3_0.wdl" as G
import "tools/Gatk4HaplotypeCaller_4_1_3_0.wdl" as G2
import "tools/UncompressArchive_v1_0_0.wdl" as U
import "tools/SplitMultiAllele_v0_5772.wdl" as S
workflow GATK4_GermlineVariantCaller {
input {
File bam
File bam_bai
File? intervals
File reference
File reference_fai
File reference_amb
File reference_ann
File reference_bwt
File reference_pac
File reference_sa
File reference_dict
File snps_dbsnp
File snps_dbsnp_tbi
String? haplotype_caller_pairHmmImplementation = "LOGLESS_CACHING"
}
call G.Gatk4SplitReads as split_bam {
input:
bam=bam,
bam_bai=bam_bai,
intervals=intervals
}
call G2.Gatk4HaplotypeCaller as haplotype_caller {
input:
pairHmmImplementation=select_first([haplotype_caller_pairHmmImplementation, "LOGLESS_CACHING"]),
inputRead=split_bam.out,
inputRead_bai=split_bam.out_bai,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict,
dbsnp=snps_dbsnp,
dbsnp_tbi=snps_dbsnp_tbi,
intervals=intervals
}
call U.UncompressArchive as uncompressvcf {
input:
file=haplotype_caller.out
}
call S.SplitMultiAllele as splitnormalisevcf {
input:
vcf=uncompressvcf.out,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict
}
output {
File variants = haplotype_caller.out
File variants_tbi = haplotype_caller.out_tbi
File out_bam = haplotype_caller.bam
File out_bam_bai = haplotype_caller.bam_bai
File out = splitnormalisevcf.out
}
}
Common Workflow Language¶
#!/usr/bin/env cwl-runner
class: Workflow
cwlVersion: v1.2
label: GATK4 Germline Variant Caller
doc: |-
This is a VariantCaller based on the GATK Best Practice pipelines. It uses the GATK4 toolkit, specifically 4.1.3.
It has the following steps:
1. Split Bam based on intervals (bed)
2. HaplotypeCaller
3. SplitMultiAllele
requirements:
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
inputs:
- id: bam
type: File
secondaryFiles:
- pattern: .bai
- id: intervals
doc: |-
This optional interval supports processing by regions. If this input resolves to null, then GATK will process the whole genome per each tool's spec
type:
- File
- 'null'
- id: reference
type: File
secondaryFiles:
- pattern: .fai
- pattern: .amb
- pattern: .ann
- pattern: .bwt
- pattern: .pac
- pattern: .sa
- pattern: ^.dict
- id: snps_dbsnp
type: File
secondaryFiles:
- pattern: .tbi
- id: haplotype_caller_pairHmmImplementation
doc: |-
The PairHMM implementation to use for genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime. The --pair-hmm-implementation argument is an enumerated type (Implementation), which can have one of the following values: EXACT;ORIGINAL;LOGLESS_CACHING;AVX_LOGLESS_CACHING;AVX_LOGLESS_CACHING_OMP;EXPERIMENTAL_FPGA_LOGLESS_CACHING;FASTEST_AVAILABLE. Implementation: FASTEST_AVAILABLE
type: string
default: LOGLESS_CACHING
outputs:
- id: variants
type: File
secondaryFiles:
- pattern: .tbi
outputSource: haplotype_caller/out
- id: out_bam
type: File
secondaryFiles:
- pattern: .bai
outputSource: haplotype_caller/bam
- id: out
type: File
outputSource: splitnormalisevcf/out
steps:
- id: split_bam
label: 'GATK4: SplitReads'
in:
- id: bam
source: bam
- id: intervals
source: intervals
run: tools/Gatk4SplitReads_4_1_3_0.cwl
out:
- id: out
- id: haplotype_caller
label: 'GATK4: Haplotype Caller'
in:
- id: pairHmmImplementation
source: haplotype_caller_pairHmmImplementation
- id: inputRead
source: split_bam/out
- id: reference
source: reference
- id: dbsnp
source: snps_dbsnp
- id: intervals
source: intervals
run: tools/Gatk4HaplotypeCaller_4_1_3_0.cwl
out:
- id: out
- id: bam
- id: uncompressvcf
label: UncompressArchive
in:
- id: file
source: haplotype_caller/out
run: tools/UncompressArchive_v1_0_0.cwl
out:
- id: out
- id: splitnormalisevcf
label: Split Multiple Alleles
in:
- id: vcf
- id: reference
source: reference
run: tools/SplitMultiAllele_v0_5772.cwl
out:
- id: out
id: GATK4_GermlineVariantCaller